378 8.5  Advanced In Silico Analysis Tools

acid residues, of a molecule with unknown structure to generate an estimate for that struc­

ture, known as the target, onto a similar structure of a molecule known as the template,

which is of the same or similar structural family. The algorithms have similarity to those used

in BLAST.

Protein structures are far more conserved than protein sequences among such homologs

perhaps for reasons of convergent evolution; similar tertiary structures evolve from different

primary structures which imparts a selective advantage on that organism (but note that pri­

mary sequences with less than 20% sequence identity often belong to different structural

families). Protein structures are in general more conserved than nucleic acid structures.

Alignment in the homology fits is better in regions of distinct secondary structures being

forms (e.g., α-​helices, β-​sheets; see Chapter 2) and similarly is poorer in random coil primary

structure regions.

8.5.6  STEP DETECTION

An increasingly important computational tool is the automation of the detection of steps

in noisy data. This is especially valuable for data output from experimental single-​molecule

biophysics techniques. The effective signal-​to-​noise ratio is often small and so the distinc­

tion between real signal and noise is often challenging to make and needs to be objecti­

fied. Also, single-​molecule events are implicitly stochastic and depend upon the underlying

probability distribution for their occurrence that can often be far from simple. Thus, it is

important to acquire significant volumes of signal data, and therefore, an automated method

to extract real signal events is useful. The transition times between different molecular states

are often short compared to sampling time intervals such that a “signal” is often manifested

as a steplike change as a function of time in some physical output parameter, for example,

nanometer-​level steps in rapid unfolding domain events of a protein stretched under force

(see Chapter 6), picoampere level steps in current in the rapid opening and closing of an

ion channel in a cell membrane (see Chapter 5), and rapid steps in brightness due to photo­

bleaching events of single dye molecules (see the previous text).

One of the simplest and robust ways to detect steps in a noisy, extended time series

is to apply a running window filter that preserves the sharpness and position of a step

edge. A popular method uses a simple median filter, which runs a window of number

n consecutive data points in a time series on the data such that the running output is

the median from data points included in the window. A common method to deal with

a problem of potentially having 2n fewer data points in the output than in the raw data

is to reflect the first and last n data points to the beginning and end of the time series.

Another method uses the Chung–​Kennedy filter. The Chung–​Kennedy filter consists

of two adjacent windows of size n run across the data such that the output switches

between the two windows in being the mean value from the window that has the smallest

variance (Figure 8.10). The logic here is that if one edge encapsulates a step event, then

the variance in that window is likely to be higher.

Both median and Chung–​Kennedy filters converge on the same expected value, though the

sample variance on the expected value of a mean distribution (i.e., the square of the standard

error of the mean) is actually marginally smaller than that of a median distribution; the vari­

ance on the expected value from a sampled median distribution is σ2π/​2n, which compares

with the sample mean of σ2/​n (students seeking solace in high-​level statistical theory should

see Mood et al., 1974), so the error on the expected median value will be larger by a factor of

~√(π/​2) or ~25%. There is therefore an advantage in using the Chung–​Kennedy filter. Both

filters require that the size of n is less than the typical interval between step events; otherwise,

the window encapsulates multiple steps and generates nonsensible outputs, so it requires

ideally some prior knowledge of likely stepping rates. These edge-​preserving filters improve

the signal-​to-​noise ratio for noisy time series by a factor of ~√n. The decision of whether a

putative step event is real or not can be made on the basis of the probability of the observed

size of the putative step in light of the underlying noise. One way to achieve this is to perform

a Student’s t-​test to examine if the mean values of the data, <x>, on either side over a window